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of  effective  and  robust  methods  for  persistent  target  tracking  and  acquisition  in  unconstrained  environments.  As  a  result,  it  is  still 
challenging  while  desirable  to  have  long-duration  target  tracking  systems  in  many  emerging  applications,  e.g.,  video  surveillance  and  sensor 
networks. 
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Scientific  Progress 


In  the  first  period  of  the  project  (9/2008-2/2009),  we  mainly  investigated  the  selective  attention  model  that  incorporates  four 
computational  processes:  early  attentional  selection,  collaborative  tracking,  context-aware  learning  for  late  selection,  and  robust 
information  fusion.  Details  can  be  found  in  the  previous  year  report. 

In  the  second  period  of  the  project  (2/2009-2/2010),  we  focused  our  study  on  context-aware  tracking  and  robust  multimodal 
fusion.  Based  on  our  new  model  of  context  flow,  we  developed  a  powerful  method  to  track  general  targets  based  on  dense 
context,  and  to  track  small  targets  based  on  sparse  context. 

In  the  third  period  of  the  project  (2/2010-2/2011),  we  investigated  how  to  embed  discriminative  learning  in  the  late  selection 
process,  and  how  the  higher-level  selection  process  can  influence  the  lower-level  selection  process. 

In  the  fourth  period  of  the  project  (2/2011-2/2012),  we  studied  three  important  issues  in  the  proposal  SSA  model.  (1)  The  most 
critical  issue  in  visual  tracking  is  to  match  the  target  over  time.  Reliable  matching  depends  on  the  choice  of  appropriate 
similarity  metric.  We  study  how  the  metric  can  be  automatically  learnt  and  adapted  for  selective  attention.  (2)  A  practical 
problem  in  real  applications  is  to  track  small  targets  in  low-resolution  video.  We  propose  an  innovative  approach  to  track  low- 
resolution  targets  without  explicitly  performing  video  super  resolution.  (3)  The  combination  of  small  patterns  can  be  more 
discriminative.  We  study  a  new  approach  to  automatically  discover  disjunctive  and  conjunctive  patterns  from  data.  This  pattern 
mining  approach  is  one  of  the  core  components  in  the  proposed  SSA  model. 

In  the  last  period  of  the  project  (2/2012-9/2012),  we  kept  investigating  the  important  issues  in  online  metric  adjustment  and 
selective  attention,  as  identified  in  the  past  years.  Specifically,  we  have  studied  (1)  the  optimal  order  determination  in  metric 
learning,  and  (2)  the  computational  modeling  on  the  unification  of  spatial  selection  and  attribute  selection. 

(1)  Order  Determination  for  the  Optimal  Metric 

Our  studies  in  the  previous  years  in  this  project  demonstrated  that  (a)  the  matching  metric  is  one  of  the  most  critical  issues  in 
target  tracking,  and  using  a  pre-defined  metric  is  not  plausible;  (b)  the  matching  metric  needs  to  be  adaptive  to  the  tracking 
scene  so  as  to  better  separate  the  target  and  the  distracters,  and  (3)  such  scene-aware  metric  can  be  learned  and  adapted 
along  with  target  tracking,  if  good  training  data  can  be  collected  on-the-fly  with  the  scene  change,  e.g.,  based  on  some  prior 
knowledge.  In  the  past  several  years,  we  have  developed  two  innovative  approaches  to  learn  adaptive  metric  for  tracking. 

We  also  found  that  the  order  of  the  metric  plays  an  important  role.  Adjusting  the  matching  metric  can  be  regarded  as  projecting 
or  mapping  the  original  feature  space  to  a  new  feature  space  (linearly  or  nonlinearly).  The  order  of  the  metric  is  the  dimension 
of  such  a  new  feature  space.  If  a  lower  dimension  is  used,  the  learned  metric  may  be  an  under-fit  of  the  data;  and  a  higher 
dimension  generally  leads  to  over-fit  of  the  data.  Therefore,  we  study  this  important  issue  of  order  determination,  i.e.,  finding  the 
optimal  dimension  for  the  projected  feature  space.  In  the  linear  case,  it  is  the  rank  of  the  Mahalonobis  metric.  This  is  important 
to  the  accuracy  and  the  generalizability  of  the  learned  metric.  But  this  is  a  very  challenging  problem,  as  the  order  is  a  structural 
parameter,  and  it  is  not  differentiable. 

We  have  found  a  novel  solution  to  this  structural  order  determination  problem.  The  idea  is  to  introduce  sparsity  prior  to 
regularize  metric  learning.  The  purpose  is  to  find  the  lowest  rank  of  the  projection  while  achieving  the  best  classification.  The 
introduced  sparsity  regularization  serves  this  purpose,  and  leads  to  the  lowest  possible  dimension.  This  can  actually  be  viewed 
as  a  minimum  description  length  regularization  for  metric  learning.  We  have  validated  this  new  solution  on  standard  benchmark 
datasets,  and  demonstrated  its  effectiveness  in  visual  target  tracking  tasks. 

(2)  Unification  of  Spatial  Selection  and  Attribute  Selection 

One  of  the  most  challenging  issues  that  prevent  video-based  tracking  from  being  reliable  and  useful  in  real  applications  is  visual 
distracters.  Visual  distracters  are  those  that  have  similar  visual  appearance  to  the  target,  but  are  actually  false  positives.  They 
mislead  and  fail  the  tracker.  This  is  very  common  when  tracking  low-resolution  targets,  and  it  is  almost  universal  in  real 
application  scenarios. 

The  resilience  of  a  good  tracker  depends  not  only  one  the  matching  metric,  but  also  on  the  characteristics  of  the  target  region  to 
be  matched.  The  two  tasks,  i.e.,  learning  the  best  metric  and  selecting  the  distracter-resilient  target  regions,  actually  correspond 
to  the  attribute  selection  and  spatial  selection  processes  in  the  human  visual  perception.  Spatial  selection  chooses  some  spatial 
salient  regions  for  matching,  and  attribute  selection  constructs  and  forms  some  good  features  for  matching.  Although  there 
have  been  many  studies  on  these  two  topics  separately  in  the  literature,  it  is  desirable  to  have  a  unified  approach  to  integrate 
both  in  a  coherent  framework. 

We  have  been  making  an  initial  attempt  to  unify  the  modeling  of  these  two  tasks  for  an  effective  solution,  based  on  the 
introduction  of  a  new  quantity  called  soft  visual  margin.  As  a  function  of  both  matching  metric  and  spatial  location,  it  measures 
the  discrimination  between  the  target  and  its  spatial  distracters,  and  characterizes  the  reliability  of  matching.  We  have  obtained 


a  novel  method  to  jointly  determine  the  best  spatial  location  and  the  optimal  metric.  Based  on  that,  a  solid  distracter-resilient 
region  tracker  is  designed,  and  its  effectiveness  is  validated  and  demonstrated  through  extensive  experiments. 
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1.  Objective 

Extending  sensing  capacity  in  target  acquisition  and  tracking  is  very  important  for  Army's  future 
force.  Video  sensors  provide  both  opportunities  and  challenges.  One  on  hand,  they  bring  rich  and 
detailed  information  to  facilitate  identifying  and  recogni2ing  the  targets.  But  on  the  other  hand,  such 
visual  information  is  subject  to  large  uncertainties  and  ambiguities  induced  by  factors  such  as 
cluttered  and  distractive  backgrounds,  illumination  changes,  visual  occlusion,  and  low  image  quality. 
AU  these  difficulties  have  impeded  the  development  of  effective  and  robust  methods  for  persistent 
target  tracking  and  acquisition  in  unconstrained  environments.  As  a  result,  it  is  still  challenging  while 
desirable  to  have  long-duration  target  tracking  systems  in  many  emerging  applications,  e.g.,  video 
surveiUances  and  sensor  networks. 

The  goal  of  the  proposed  project  is  to  overcome  the  challenges  that  confront  persistent  target 
tracking  and  acquisition,  by  developing  new  and  effective  computational  models  and  methods  that 
bridge  computational  sensing  and  some  aspects  in  human  perception  of  visual  dynamics,  and  by 
creating  multi-level  synergy  among  various  modalities  and  sensors. 


2.  Approach 

We  propose  novel  research  of  collaborative  and  persistent  target  tracking  and  acquisition,  by 
creating  a  multi-level  synergism  to  address  the  difficulties  mentioned  above  including  visual  clutters 
and  occlusions  in  video-based  tracking,  robustness  and  persistency,  target  re-acquisition  and 
association  in  distributed  video  sensors,  and  information  inconsistency  in  fusion. 

o  The  synergetic  selective  attention  model  (SSA).  Motivated  by  some  psychological  findings,  we 
propose  a  new  computational  model  for  collaborative  and  persistent  target  acquisition  and 
tracking.  This  new  model  has  four  components:  early  attentional  selection,  collaborative 
tracking,  context- aware  learning  and  late  selection,  and  robust  information  integration.  This 
new  model  is  fundamentally  different  from  the  existing  models  for  target  tracking.  A  multi¬ 
level  synergism  is  enabled  in  a  coherent  way  based  on  this  new  computational  model. 

o  Synergy  in  selective  visual  attention  to  handle  visual  clutters  and  occlusions.  Based  on  the  SSA 
model,  the  collaboration  among  the  selected  attentional  regions  leads  to  the  resilience  to 
visual  clutters  and  occlusions. 
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o  Synergf  in  contextual  visual  analysis  for  robustness  and  persistency  in  visual  tracking.  This  SSA 
model  accommodates  the  visual  contexts  of  the  target  of  interest.  Context- aware  learning  in 
the  proposed  approach  produces  robust  and  persistent  visual  tracking. 

o  Synergf  in  multiple  distributed  cameras  for  target  re-acquisition  and  association.  The  visual 
selection  and  attention  mechanism  in  the  SSA  model  also  allows  target  re-acquisition  and 
association  in  a  camera  network. 

o  Synergy  in  robust  information  fusion  for  integrating  multiple  sources  and  multiple  modality 

sensors.  The  robust  information  integration  mechanism  in  the  SSA  model  is  able  to  handle 
the  information  inconsistency  among  multiple  modality  sensors. 

One  important  innovation  of  this  project  is  the  introduction  of  some  important  findings  in 
physiological  and  psychological  vision  studies  to  the  computational  models  for  robust  and  persistent 
target  tracking  and  acquisition.  To  the  best  of  our  knowledge,  this  research  is  a  pioneering  attempt 
of  its  kind.  Most  existing  video-based  methods  has  been  confronted  by  many  challenges  in  practice, 
as  they  have  over  simplified  the  complicated  process  in  the  human  perception  of  visual  dynamics, 
and  have  ignored  many  powerful  mechanisms  in  the  human  visual  perception,  such  as  selective 
attention  and  context-awareness.  The  proposed  new  computational  models  is  the  first  of  its  kind  in 
incorporating  selective  attention  and  context-awareness  in  target  tracking,  and  is  expected  to  result 
in  more  powerful  tracking  methods. 

Another  innovation  is  the  multi-level  synergism  that  integrates  the  analysis  of  sensory  data  at  four  levels. 
Firstly,  we  represent  the  visual  appearance  of  the  target  by  its  local  invariant  regions  (called 
attentional  regions) .  The  synergy  at  the  visual  region  level  enables  selective  visual  attention  and 
overcomes  the  difficulties  of  cluttered  backgrounds  and  partial  occlusions.  Secondly,  the  tracking  of 
the  target  is  not  only  concentrated  on  the  target  itself,  but  also  incorporating  other  objects  in  its 
spatial  context.  Such  a  contextual  synergy  at  the  object  level  greatly  enhances  the  robustness  and 
persistence  for  long-duration  tracking.  Thirdly,  the  synergy  in  distributed  cameras  facilitates  robust 
data  association  and  target  re-acquisition.  Last,  the  synergy  in  multi-modal  sensors  enables  robust 
information  fusion  for  more  accurate  target  tracking  and  acquisition.  This  unification  has  never  been 
seen  in  the  literature. 

3.  Scientific  Barriers 

The  major  challenge  in  video-based  target  tracking  and  acquisition  lies  in  the  large  uncertainties  and 
ambiguities  in  video  data.  The  specific  scientific  barriers  in  this  research  are  in  the  following  four 
aspects: 

o  Uncertainties  induced  by  visual  clutters  and  occlusions.  The  major  hurdles  in  video-based  target 
tracking  include  the  background  clutters  and  occlusions.  Clutters  generate  false  positive 
matches  that  distract  the  tracker,  and  occlusions  conceal  the  visual  observations  for 
matching.  It  is  stiU  a  very  challenge  task  to  detect  and  locate  the  target  in  a  clutter  and  to 
handle  occlusions. 

o  Robustness  and  persistence  in  long-duration  target  tracking.  This  is  not  trivial,  as  the  target's  visual 
appearance  may  undergo  enormous  and  unexpected  changes  due  to  the  changes  of  view. 
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pose,  scale,  illumination  and  the  interferences  from  the  environment.  It  is  desirable  to  have 
truly  intelligent  and  adaptive  approaches. 

o  Target  re-acquisition  and  association  in  a  network  of  video  sensors.  It  is  important  to  know  when  and 
where  the  targets  have  been  seen  in  different  cameras,  i.e.,  to  associate  their  tracks  and  to  re¬ 
acquire  them  in  other  cameras.  This  problem  has  not  been  well  investigated  yet. 

o  Inconsistent  in  multi-modal  sensor  fusion.  Beside  video  sensors,  other  imaging  and  non-imaging 
sensors  may  be  used.  Integrating  multi-modal  sensors  may  enhance  the  tracking  accuracy, 
but  in  practice  it  is  often  confronted  by  a  special  difficulty  that  different  sensors  may  provide 
inconsistent  or  even  conflicting  information.  Regardless  fusion  leads  to  meaningless  and 
wrong  results.  Detecting  and  handling  the  inconsistency  in  fusion  is  still  an  open  research 
problem. 


4.  Significance 

The  proposed  research  in  this  project  will  have  a  significant  potential  impact  on  Army  in  the 
following  ways: 

o  In  many  military  scenarios,  surveillance  is  important  for  both  battlefields  and  non- 

battlefields.  Good  surveillance  systems  provide  soldiers  safe  and  secure  environments.  The 
proposed  target  tracking  and  acquisition  methods  will  greatly  improve  the  performance  of 
video-based  surveillance  systems; 

o  Digital  Soldier  is  a  trend  for  the  future  warfare,  where  an  individual  soldier's  abilities  in 
sensing  and  communication  are  largely  extended.  It  can  be  envisioned  that  such  a  Digital 
Soldier  is  equipped  with  a  powerful  and  intelligent  video  sensor,  which  is  able  to  see  what 
the  soldier  is  not  able  to  see  or  does  not  have  time  to  see.  This  third  eye  will  be  very 
significant  for  the  survivability  of  the  soUder  in  action.  This  proposed  target  tracking  and 
acquisition  method  can  be  a  critical  and  fundamental  component  for  this  third  eje\ 

o  Military  robots,  such  as  combat  robots,  are  becoming  part  of  the  battlefield.  Some  of  them 
are  autonomous,  and  they  need  powerful  sensing  capacity  to  move  and  act.  For  example, 
accurate  and  real-time  targeting  is  important  for  saving  the  ammunitions  of  a  combat  robot. 
They  are  beneficial  from  the  proposed  research  as  well; 

o  Urban  warfare  is  challenging,  as  the  environment  is  complex.  The  proposed  methods  of 
persistent  target  acquisition  and  tracking  can  be  quite  useful  in  urban  warfare.  For  example, 
distributed  cameras  will  collaborate  to  locate  and  track  the  target  in  a  local  area.  Such  real¬ 
time  information  can  be  utilized  for  further  actions. 


5.  Accomplishments  of  Current  Period 

In  the  first  period  of  the  project  (9/2008-2/2009),  we  mainly  investigated  the  selective  attention 
model  that  incorporates  four  computational  processes:  early  attentional  selection,  collaborative 


3/7 


54 348 -Cl  Project  Summary  Sheets  2013 


tracking,  context- aware  learning  for  late  selection,  and  robust  information  fusion.  Details  can  be 
found  in  the  previous  year  report. 

In  the  second  period  of  the  project  (2/2009-2/2010),  we  focused  our  study  on  context-aware 
tracking  and  robust  multimodal  fusion.  Based  on  our  new  model  of  context  flow,  we  developed  a 
powerful  method  to  track  general  targets  based  on  dense  context,  and  to  track  small  targets  based  on 
sparse  context. 

In  the  third  period  of  the  project  (2/2010-2/2011),  we  investigated  how  to  embed  discriminative 
learning  in  the  late  selection  process,  and  how  the  higher-level  selection  process  can  influence  the 
lower-level  selection  process. 

In  the  fourth  period  of  the  project  (2/2011-2/2012),  we  studied  three  important  issues  in  the 
proposal  SSA  model.  (1)  The  most  critical  issue  in  visual  tracking  is  to  match  the  target  over  time. 
Reliable  matching  depends  on  the  choice  of  appropriate  similarity  metric.  We  study  how  the  metric 
can  be  automatically  learnt  and  adapted  for  selective  attention.  (2)  A  practical  problem  in  real 
applications  is  to  track  small  targets  in  low-resolution  video.  We  propose  an  innovative  approach  to 
track  low-resolution  targets  without  explicitly  performing  video  super  resolution.  (3)  The 
combination  of  small  patterns  can  be  more  discriminative.  We  study  a  new  approach  to 
automatically  discover  disjunctive  and  conjunctive  patterns  from  data.  This  pattern  mining  approach 
is  one  of  the  core  components  in  the  proposed  SSA  model. 

In  the  last  period  of  the  project  (2/2012-9/2012),  we  kept  investigating  the  important  issues  in 
online  metric  adjustment  and  selective  attention,  as  identified  in  the  past  years.  Specifically,  we  have 
studied  (1)  the  optimal  order  determination  in  metric  learning,  and  (2)  the  computational  modeling 
on  the  unification  of  spatial  selection  and  attribute  selection. 

5.1  Order  Determination  for  the  Optimal  Metric 

Our  studies  in  the  previous  years  in  this  project  demonstrated  that  (a)  the  matching  metric  is  one  of 
the  most  critical  issues  in  target  tracking,  and  using  a  pre-defined  metric  is  not  plausible;  (b)  the 
matching  metric  needs  to  be  adaptive  to  the  tracking  scene  so  as  to  better  separate  the  target  and  the 
distracters,  and  (3)  such  scene-aware  metric  can  be  learned  and  adapted  along  with  target  tracking,  if 
good  training  data  can  be  collected  on-the-fiy  with  the  scene  change,  e.g.,  based  on  some  prior 
knowledge.  In  the  past  several  years,  we  have  developed  two  innovative  approaches  to  learn 
adaptive  metric  for  tracking. 

We  also  found  that  the  order  of  the  metric  plays  an  important  role.  Adjusting  the  matching  metric 
can  be  regarded  as  projecting  or  mapping  the  original  feature  space  to  a  new  feature  space  (linearly 
or  nonlinearly) .  The  order  of  the  metric  is  the  dimension  of  such  a  new  feature  space.  If  a  lower 
dimension  is  used,  the  learned  metric  may  be  an  under- fit  of  the  data;  and  a  higher  dimension 
generally  leads  to  over-fit  of  the  data.  Therefore,  we  study  this  important  issue  of  order 
determination,  i.e.,  finding  the  optimal  dimension  for  the  projected  feature  space.  In  the  linear  case, 
it  is  the  rank  of  the  Mahalonobis  metric.  This  is  important  to  the  accuracy  and  the  generali2ability  of 
the  learned  metric.  But  this  is  a  very  challenging  problem,  as  the  order  is  a  structural  parameter,  and 
it  is  not  differentiable. 
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We  have  found  a  novel  solution  to  this  structural  order  determination  problem.  The  idea  is  to 
introduce  sparsity  prior  to  regularize  metric  learning.  The  purpose  is  to  find  the  lowest  rank  of  the 
projection  while  achieving  the  best  classification.  The  introduced  sparsity  regularization  serves  this 
purpose,  and  leads  to  the  lowest  possible  dimension.  This  can  actually  be  viewed  as  a  minimum 
description  length  regularization  for  metric  learning.  We  have  validated  this  new  solution  on 
standard  benchmark  datasets,  and  demonstrated  its  effectiveness  in  visual  target  tracking  tasks. 

5.2  Unification  of  Spatial  Selection  and  Attribute  Selection 

One  of  the  most  challenging  issues  that  prevent  video-based  tracking  from  being  reliable  and  useful 
in  real  applications  is  visual  distracters.  Visual  distracters  are  those  that  have  similar  visual 
appearance  to  the  target,  but  are  actually  false  positives.  They  mislead  and  fail  the  tracker.  This  is 
very  common  when  tracking  low-resolution  targets,  and  it  is  almost  universal  in  real  application 
scenarios. 

The  resilience  of  a  good  tracker  depends  not  only  one  the  matching  metric,  but  also  on  the 
characteristics  of  the  target  region  to  be  matched.  The  two  tasks,  i.e.,  learning  the  best  metric  and 
selecting  the  distracter-resilient  target  regions,  actually  correspond  to  the  attribute  selection  and 
spatial  selection  processes  in  the  human  visual  perception.  Spatial  selection  chooses  some  spatial 
salient  regions  for  matching,  and  attribute  selection  constructs  and  forms  some  good  features  for 
matching.  Although  there  have  been  many  studies  on  these  two  topics  separately  in  the  literature,  it 
is  desirable  to  have  a  unified  approach  to  integrate  both  in  a  coherent  framework. 

We  have  been  making  an  initial  attempt  to  unify  the  modeling  of  these  two  tasks  for  an  effective 
solution,  based  on  the  introduction  of  a  new  quantity  called  soft  visual  margin.  As  a  function  of  both 
matching  metric  and  spatial  location,  it  measures  the  discrimination  between  the  target  and  its  spatial 
distracters,  and  characterizes  the  reliability  of  matching.  We  have  obtained  a  novel  method  to  joindy 
determine  the  best  spatial  location  and  the  optimal  metric.  Based  on  that,  a  solid  distracter-resilient 
region  tracker  is  designed,  and  its  effectiveness  is  validated  and  demonstrated  through  extensive 
experiments. 

6.  Collaborations  and  Leveraged  Funding 

This  project  leverages  the  research  of  an  NSF  project,  “CAREEK-  Visual  Analysis  ofEligh-Dimensional 
Motion:  A  Distrihuted! Collaboration  ApproacH^  1/2004-2/2011.  This  NSF  project  is  focused  on  the 
estimation  of  complex  and  high-dimensional  motions,  such  as  the  articulated  motion  of  the  human 
body,  and  the  deformable  motion  of  elastic  shapes. 


7.  Conclusions 

Persistent  target  acquisition  and  tracking  in  video  is  one  of  the  key  issues  in  video  analysis  and 
understanding,  and  thus  it  is  a  fundamental  component  in  many  emerging  applications.  Although 
video  convey  rich  information  about  the  scene,  the  uncertainties  in  the  visual  information  induced 
by  cluttered  background  and  occlusion  largely  make  this  problem  quite  challenging.  This  project  is 
targeted  on  developing  new  and  effective  computational  models  and  methods  that  bridge 
computational  sensing  and  some  aspects  in  human  perception  of  visual  dynamics,  especially  selective 
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attention  and  context-awareness  in  human  visual  perception.  This  project  aims  to  create  a  multi-level 
synergism  to  address  the  difficulties  including  visual  clutters  and  occlusions  in  video-based  tracking, 
robustness  and  persistency,  target  re- acquisition  and  association  in  distributed  video  sensors,  and 
information  inconsistency  in  fusion.  This  project  shall  lead  to  a  new  computational  model,  a  class  of 
effective  methods,  and  a  set  of  powerful  tools  for  video-based  target  tracking  and  acquisition. 

8.  Technology  Transfer 

We  have  not  had  technology  transfer  activities  so  far,  but  we  plan  to  develop  soUd  collaborations 
with  researchers  at  ART  as  well  as  some  industrial  companies. 


9.  Future  Research  Plans 

Although  the  project  has  concluded  with  a  promising  new  computational  model  for  persistent  target 
tracking  and  a  family  of  robust  tracking  methods,  the  research  can  be  advanced  further  on  the 
following  directions: 

o  This  computational  model  is  motivated  by  some  interesting  findings  in  the  studies  of  the 
human  visual  perception.  It  attempts  to  give  a  plausible  explanation  to  the  known  but 
vague  mechanisms  in  human  visual  perception.  A  very  interesting  and  far-reaching 
research  is  to  investigate  how  this  computational  model  can  be  used  in  the  visual 
cognition  and  psychological  studies,  to  predict  some  hidden  mechanisms  in  human  visual 
cognition.  For  example,  what  aspects  of  selection  should  be  in  the  early  stage,  and  what 
aspects  are  learned  in  the  latter  stage.  This  shall  largely  bridge  the  study  of  computational 
vision  and  human  vision. 

o  Another  direction  is  hardware  implementation  of  this  computational  models  and 

methods.  Some  components  in  the  proposed  SSA  model  demand  more  computational 
resources,  e.g.,  the  online  visual  pattern  mining  component  and  the  online  metric 
learning  component.  As  they  need  to  be  adaptive  and  updated  over  time,  it  is  difficult  for 
our  current  software-based  implementation  to  achieve  real-time  performance.  It  is  of 
great  interest  if  these  methods  can  be  paralleli2ed  and  have  hardware  implementations. 


10.  Anticipated  Scientific  Accomplishments 

The  following  is  what  we  anticipate  on  this  research  project: 

o  A  new  computational  model  for  video-based  target  tracking  and  acquisition.  This  new 
model  incorporates  the  selective  attention  mechanism  of  the  human  visual  perception  into 
the  mathematical  modeling,  and  it  will  enable  context- aware  motion  tracking,  and  unify  the 
adaptation  at  various  levels  of  visual  information. 

o  Effective  methods  for  target  tracking  and  acquisition.  Our  new  computational  model  leads 
to  a  class  of  effective  target  tracking  methods  that  are  able  to  automatically  focus  their 
computational  attention  to  selected  image  regions  and  features,  and  able  to  make  user  of  the 
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contextual  information  from  the  environments.  We  expect  this  class  of  new  tracking  method 
to  significantly  outperform  the  methods  in  the  current  state-of-the-art,  in  obtaining  robust 
and  persistent  target  tracking  for  a  long  duration. 

o  Powerful  tools  for  video  analysis  and  target  tracking.  In  this  project,  a  set  of  powerful  tools 
wiU  be  developed,  including  but  not  Umited  to  contextual  flow  computation,  salient 
attentional  region  determination,  on-Une  mining  of  motion  correlation,  learning  dynamic 
random  field,  detection  of  information  inconsistency,  etc.  These  powerful  tools  are  useful 
for  a  large  variety  of  video  analysis  tasks. 
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